Yummy: A Recipe Reviews Sentiment Analysis

Data Mining Project

Andrea Turconi - Mat: 730225

The aim of this jupyter notebook is to provide a full explanation of a sentiment analysis executed on the "food.com" dataset. Since the dataset provides the ratings of the users it is possible to train a supervised analysis. Furthermore in the notebook are provided the results of different algorithms and models applied on same dataset and compared with some State-of-art models.

Sentiment Analysis, or Opinion Mining, is a sub-field of Natural Language Processing (NLP) that tries to identify and extract opinions within a given text. The aim of sentiment analysis is to gauge the attitude, sentiments, evaluations, attitudes and emotions of a speaker/writer based on the computational treatment of subjectivity in a text.

Exploratory dataset analysis

The prodived dataset is available at:

This dataset consists of 180K+ recipes and 700K+ recipe reviews covering 18 years of user interactions and uploads on Food.com (formerly GeniusKitchen).

The dataset provides two files, called RAW_recipes and RAW_interactions, that represents respectively the recipes, along with all the informations, and the reviews prodived by different users. Since the aim of this projects is a sentiment analysis based on the reviews it is possible to use only the RAW_interactions dataset.

The analyzed dataset is the one that contains all the reviews informations. As it is possible to see in the table below, a review is composed by:

From the below information it is possible to get a clear overview of the complete dataset: the reviews dataset contains 1132367 unique recipes. The reviews are written by 226570 unique users and covers 231637 recipes.

The dataset contains also null values (169). It is mandatory to investigate in which of the previuos explained columns the NaN values are present. From the following informations it is possible to say that all the NaN values are contained in the review column. Since this column is mandatory for the scope of this project, the reviews without descriptions are dropped.

Then, it is possible to compute the distribution of the reviews based on the number of words: from the following graph it is possible to state that the most reviews has an amount of words around 30. (31 with 17196 reviews and 33 with 17124 reviews and 30 with 17043 reviews)

To simplify the process, it is possible to save the mentioned dataset in a csv file, in order to be subsequently loaded by the model.

Sentiment Analysis

Data Analysis

From the following graph it is possible to analyse the rating percentage of the user reviews. From here, it is possible to see that the most of the costumers rating is positive. Hence, it is possible to hope that also the most reviews will be pretty positive.

Classifying reviews

In this step, the classification of the reviews into positive or negative is provided. From the following classification it is possible to generate the training set for our model.

Positive reviews will be classified as +1, and negative reviews will be classified as -1. From that, all reviews with "rating" > 3 will be classified as +1, indicating that they are positive. All reviews with "rating" < 3 will be classified as -1. Reviews with "rating" = 3 will be dropped, because they are neutral.

Analysis on classification

Once all the reviews are classified with a sentiment, +1 for positive reviews, -1 for negative, it is possible to explore and analyse the classified data. First of all the data are divided into two differents dataframes: one with all the positve reviews, and the another with all the negative reviews.

After removing the neutral reviews from the below information it is possible to say that:

From the following histogram it is possible to see the distributions of reviews with sentiment across the dataset.

Data cleaning for the model

In order to feed to model, the data need to be cleaned and preprocessed: it is first necessary to remove all punctuation from the data, then it is needed to remove all the stopwords.

Logistic Regression - Building the model

It is now possible to build the sentiment model. The model will take reviews in as input. It will then predict if the reviews is positive or negative. Since the problem is a classification problem, it is possible to train a simple logistic regression model.

Logistic regression is kind of like linear regression, but is used when the dependent variable is not a number but something else (e.g., a "yes/no" response). It's called regression but performs classification based on the regression and it classifies the dependent variable into either of the classes.

Then all the dataset is splitted into 2 dataframe, the first 80% of data for training set, the remaining 20% for test set.

Since logistic regression can't understand text, it is mandatory to convert all the text values into a bag of words model. Hence, the dataframe is transformed into a bag of words model, which will contain a sparse matrix of integers.

Now it is possible to apply the logistic regression on the training data to build the model

Evaluate the model

In order to get an evaluation of how the model predicts well the data, it is possible to compute some classification report considering the precision, the recall and the f1-score. Hence, the following steps describe how to compute a confusion matrix for the model and how to compute accuracy measures from the matrix.

The overall accuracy of the model on the test data is around 94%, which is pretty good. But, as it is possible to see from the classification report, the precision, recall and f1-score for the negative reviews are very low. This is maybe due to an unbalanced training set. This is possible to see also from the precision of the positive reviews, since it is nearly to 100%. By the way it is possible to balance the training dataset.

Balancing class

From the plots in the above section and the classificaiton results it is clear that the data are unbalanced, so, it is possible to take a subset of the dataset in order to have the same amout of data for both classes.

From the following plot it is possible to see that now the training dataset is balances. It contains the same amount of positive reviews and negative reviews.

Tools for building the model

Logistic Regression

After the balancement of the dataset it is possible to rebuild the same Logistic Regression presented before.

Evaluate the model

From the results it is possible to see that the accuracy of the model is decreased to 76%. This is maybe due to the balanced data.

Naive Bayes Classifier - Building the model

It is now possible to compute other classification algorithm on the balanced data to compare with the Logistic Regression model. It is possible to start with a Naive Bayes Classifier: this is a classification algorithm that relies on Bayes' Theorem. This theorem provides a way of calculating a type or probability called posterior probability, in which the probability of an event A occurring is reliant on probabilistic known background (e.g. event B evidence).

The accuracy of Multinomial Naive Bayes is 75%, very close to the Logistic Regression

Support Vector Machine (SVM) - Building the model

Again, it is possible to compute an other classification algorithm: Support Vector Machines. SVM is provided to solve classification and regression problems. It can easily handle multiple continuous and categorical variables. SVM constructs a hyperplane in multidimensional space to separate different classes. SVM generates optimal hyperplane in an iterative manner, which is used to minimize an error. The core idea of SVM is to find a maximum marginal hyperplane(MMH) that best divides the dataset into classes

The accuracy of SVM is 76%, very close to the Logistic Regression and the Naive Bayes Classifier

Using TF-IDF

It is possible to try using TF-IDF instead of CountVectorizer to embed the words and train the algorithm presented on the above sections

From the above results it is possible to see that the accuracy using TF-IDF is not either better either worse than the accuracy with CountVectorizer

Normalize the dataset

In order to improve the accuracy of the models it is possible to clean the dataset in a more complicated way. The data will be normalized using two provided methods: Stemming and Lemmatization. Normalization is a common step in text preprocessing, with this technique words with different forms are converted into one. Stemming is considered to be the more crude/brute-force approach to normalization, this algorithm use basic rules to chop off the ends of words. Lemmatization works by identifying the part-of-speech of a given word and then applying more complex rules to transform the word into its true root.

Train the model

In order to test the model and see if the normalization can make improvement on the classification, it is possible to rebuild the previous algorithm with the normalizated dataset using one of the tow bag of words model presented above.

As it is possible to see from the above results the accuracy are very close (some are the same) to the accuracy obtained in the previous section of the project.

Using N-Grams

N-Grams are two or three word sequences (bigrams or trigrams). With the implementation of N-Grams (n-words association) the model can potentially be more predictive. For example, if a review had the three word sequence "didn't love recipe" the model would only consider these words individually with a unigram-only model and probably not capture that this is actually a negative sentiment because the word "love" by itself is going to be highly correlated with a positive review.

As it is possible to see from the above results the accuracy of Multinomial Naive Bayes classifier is sligthly better (75% instead of 73%), also the Logistic Regression accuracy is increased (77% instead of 76%) and the accuracy of the other algorithm (Linear Support Vector Machine) are the same.

Using Vader

In order to compare the previous model it is possible to compute a senitiment analysis with a standalone senitiment analysis tool. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER uses a combination of a sentiment lexicon: a list of lexical features (e.g., words) which are generally labelled according to their semantic orientation as either positive or negative. VADER has been found to be quite successful when dealing with social media texts, NY Times editorials, movie reviews, and product reviews. This is because VADER not only tells about the Positivity and Negativity score but also tells us about how positive or negative a sentiment is

From the above cells it is possible to see the use of VADER Sentiment analysis. Given a sentence, it returns with a dictionary of negative, neutral, positive and compound values. The Positive, Negative and Neutral scores represent the proportion of text that falls in these categories. For example, in the above cell the sentence was rated as 57% Positive, 42% Neutral and 0% Negative. Hence, all of this will sum up to a positive classification. The Compound score is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1(most extreme negative) and +1 (most extreme positive). In the case above, the compound score turns out to be 0.77 , denoting a very high positive sentiment.

It is now possible to applicate the VADER Sentiment analysis on the normalized, cleaned and balanced data in order to test its accuracy.

Now, all the reviews have a computed vader sentiment. In order to test the accuracy and the precision of VADER, it is needed to classify the vader sentiment into +1 (positive review) or -1 negative review. As said before, reviews that have a VADER sentiment greater than 0 will be classified as positives, instead, reviews that have a VADER sentiment less then 0 will be classified as negatives.

FastText

In order to have an evaluation of the models presented before it is possible to compare with a state-of-art model presented in the following paper:

In the paper is presented "FastText": a library for efficient learning of word representations and sentence classification. Its goal is to provide word embedding and text classification efficiently. According to their authors, it is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation.

The core of FastText relies on the Continuous Bag of Words (CBOW) model for word representation and a hierarchical classifier to speed up training. CBOW is a shallow neural network that is trained to predict a word from its neighbors. FastText replaces the objective of predicting a word with predicting a category. These single-layer models train incredibly fast and can scale very well. Also, fastText replaces the softmax over labels with a hierarchical softmax. Here each node represents a label. This reduces computation as we don’t need to compute all labels probabilities. The limited number of parameters reduces training time.

FastText needs labeled data to train the supervised classifier. Labels must start by the prefix "label", which is how it recognizes what a label or what a word is, followed by "NEGATIVE" or "POSITIVE" based on how the review is classified. Hence, it is possible to create a new column on the dataset with the review written in that way.

FastText needs also two file: a training file with all the labelled training reviews, and a test file, with all the labelled testing reviews. Hence, it is possible to split the dataset into train and test with the same amount of the experiment before (train 80%, test 20%) and save them into two ".txt" file

Train FastText

FastText can be trained with the use of his predefined function "train_supervised". For this experiment it is possible to focus on the following argoments of the method:

First of all, since in the previous experiment the model has been trained without the N-Grams it is possible to run FastText without N-Grams in order to compare them. Then it is possible to add a bi-gram (2-grams) in order to check if model can improve.

Evaluate FastText

From the following code it is possible to evaluate the two FastText models, the one without N-Grams has an accuracy of 76%, the one with 2-Grams has an accuracy of 77%

Results

In order to have a clear evaluation of all the models provided and all the results it is possible to compare all of them into a table. The best results so far are obtained with a normalized dataset with TF-IDF and N-Grams using the Logistic Regression (77%). But it must be sad that all the results of the models are very close to each other. Vader is the only model that has produced poor results (59%). This is maybe due to the normalization, since this library can work really well with raw data, punctuation and block letters. Also, it is possible to state that state-of-art paper "FastText" produced results very similar to the ones produced by all the other models.

The first Logistic Regression applied, with an accuracy of 94%, is the one that uses the unbalanced dataset. From the results of precision and recall it is possible to state that the accuracy of this model is is very influenced by the classification of positive reviews, since they were almost 80% of the entire dataset.

Concluding, it is possible to state that despite the application of various models and differentiated balanced datasets, the accuracy percentage of this dataset is always around 76%.

Furthermore, the analysis may not end here: future developments could be search if there is a correlation between the number of words, the number of ingredients and the recipe instructions with the user rating.